pandas官方入门教程:10 Minutes to pandas
下面的操作依赖于这些包,所以操作之前要引入这些包
|
|
1. Object Creation 创建对象
将数组转换为
Series
对象1234567891011In [4]: s = pd.Series([1,3,5,np.nan,6,8])In [5]: sOut[5]:0 1.01 3.02 5.03 NaN4 6.05 8.0dtype: float64将
numpy
数组转换为DataFrame
-
创建一个
datetime index
,起始时间为20170320,时间跨度为7天,默认的freq
为'D'
。如果没有periods
参数则必须指定结束时间end
。
1234567891011121314151617181920In [6]: dates = pd.date_range('20170320', periods=7)In [7]: datesOut[7]:DatetimeIndex(['2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23','2017-03-24', '2017-03-25', '2017-03-26', '2017-03-27'],dtype='datetime64[ns]', freq='D')In [8]: dates = pd.date_range('20170320','20170327')In [9]: datesOut[9]:DatetimeIndex(['2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23','2017-03-24', '2017-03-25', '2017-03-26', '2017-03-27'],dtype='datetime64[ns]', freq='D')# 更改freqIn[10]: dates = pd.date_range('20170320', periods=4,freq='w-mon')In[11]: datesOut[11]:DatetimeIndex(['2017-03-20', '2017-03-27', '2017-04-03', '2017-04-10'], dtype='datetime64[ns]', freq='W-MON')-
np.random.randn(4,4)
生成一个numpy
4*4的数组,其中的每一个数服从正态分布。 index=dates 将dates指定为index。
columns指定列名称
123456789In[86]: df = pd.DataFrame(np.random.randn(4,4), index=dates, columns=list('ABCD'))In[87]: dfOut[87]:A B C D2017-03-20 -0.273931 1.464139 0.511461 0.7070712017-03-27 0.290527 0.799572 0.526141 0.6964592017-04-03 0.390172 0.171379 -1.321742 0.9413582017-04-10 0.963824 -1.089391 0.133069 1.007356-
将字典对象转换成
DataFrame
1234567891011121314In[88]: df2 = pd.DataFrame({ 'A' : 1.,....: 'B' : pd.Timestamp('20130102'),....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),....: 'D' : np.array([3] * 4,dtype='int32'),....: 'E' : pd.Categorical(["test","train","test","train"]),....: 'F' : 'foo' })In[89]: df2Out[89]:A B C D E F0 1.0 2013-01-02 1.0 3 test foo1 1.0 2013-01-02 1.0 3 train foo2 1.0 2013-01-02 1.0 3 test foo3 1.0 2013-01-02 1.0 3 train foo
2. Viewing Data 查看数据
dtypes 查看各列的数据类型
1234567891011121314In[94]: df2.dtypesOut[94]:A float64B int64C float32D int32E categoryF objectdtype: objectIn[96]: df['A'].dtypeOut[96]:dtype('float64')head(),tail(),查看数据框头几行和最后几行的数据
1234567891011121314In[97]: df.head()Out[97]:A B C D2017-03-20 -0.273931 1.464139 0.511461 0.7070712017-03-27 0.290527 0.799572 0.526141 0.6964592017-04-03 0.390172 0.171379 -1.321742 0.9413582017-04-10 0.963824 -1.089391 0.133069 1.007356In[98]: df.tail(2)Out[98]:A B C D2017-04-03 0.390172 0.171379 -1.321742 0.9413582017-04-10 0.963824 -1.089391 0.133069 1.007356查看索引df.index
123In[99]: df.indexOut[99]:DatetimeIndex(['2017-03-20', '2017-03-27', '2017-04-03', '2017-04-10'], dtype='datetime64[ns]', freq='W-MON')查看列df.colums
123In[100]: df.columnsOut[100]:Index(['A', 'B', 'C', 'D'], dtype='object')查看这个DataFrame下的numpy data:df.values
1234567In[101]: df.valuesOut[101]:array([[-0.27393096, 1.46413903, 0.5114613 , 0.70707094],[ 0.29052687, 0.79957181, 0.52614137, 0.69645894],[ 0.39017244, 0.17137885, -1.32174227, 0.9413581 ],[ 0.9638237 , -1.08939135, 0.13306936, 1.00735588]])df.describe:给出这个数据框的statistic summary
123456789101112In[102]: df.describe()Out[102]:A B C Dcount 4.000000 4.000000 4.000000 4.000000mean 0.342648 0.336425 -0.037768 0.838061std 0.506953 1.087265 0.875104 0.159730min -0.273931 -1.089391 -1.321742 0.69645925% 0.149412 -0.143814 -0.230634 0.70441850% 0.340350 0.485475 0.322265 0.82421575% 0.533585 0.965714 0.515131 0.957858max 0.963824 1.464139 0.526141 1.007356df.T:转置矩阵,转置后index也随之改变
1234567891011In[103]: df.TOut[103]:2017-03-20 2017-03-27 2017-04-03 2017-04-10A -0.273931 0.290527 0.390172 0.963824B 1.464139 0.799572 0.171379 -1.089391C 0.511461 0.526141 -1.321742 0.133069D 0.707071 0.696459 0.941358 1.007356In[104]: df.T.indexOut[104]:Index(['A', 'B', 'C', 'D'], dtype='object')df.sort_index():根据行列索引排序,默认axis=0,根据索引排序,当axis=1,根据列排序
123456789101112131415161718192021222324252627282930313233In[119]: df.sort_index(ascending=True)Out[119]:A B C D2017-03-20 -0.273931 1.464139 0.511461 0.7070712017-03-27 0.290527 0.799572 0.526141 0.6964592017-04-03 0.390172 0.171379 -1.321742 0.9413582017-04-10 0.963824 -1.089391 0.133069 1.007356In[120]: df.sort_index(ascending=False)Out[120]:A B C D2017-04-10 0.963824 -1.089391 0.133069 1.0073562017-04-03 0.390172 0.171379 -1.321742 0.9413582017-03-27 0.290527 0.799572 0.526141 0.6964592017-03-20 -0.273931 1.464139 0.511461 0.707071In[122]: df.sort_index(axis=1, ascending=False)Out[122]:D C B A2017-03-20 0.707071 0.511461 1.464139 -0.2739312017-03-27 0.696459 0.526141 0.799572 0.2905272017-04-03 0.941358 -1.321742 0.171379 0.3901722017-04-10 1.007356 0.133069 -1.089391 0.963824In[123]: df.sort_index(axis=1, ascending=True)Out[123]:A B C D2017-03-20 -0.273931 1.464139 0.511461 0.7070712017-03-27 0.290527 0.799572 0.526141 0.6964592017-04-03 0.390172 0.171379 -1.321742 0.9413582017-04-10 0.963824 -1.089391 0.133069 1.007356df.sort_values:根据值排序
12345678910111213141516In[124]: df.sort_values(by='A')Out[124]:A B C D2017-03-20 -0.273931 1.464139 0.511461 0.7070712017-03-27 0.290527 0.799572 0.526141 0.6964592017-04-03 0.390172 0.171379 -1.321742 0.9413582017-04-10 0.963824 -1.089391 0.133069 1.007356In[125]: df.sort_values(by='A',ascending=False)Out[125]:A B C D2017-04-10 0.963824 -1.089391 0.133069 1.0073562017-04-03 0.390172 0.171379 -1.321742 0.9413582017-03-27 0.290527 0.799572 0.526141 0.6964592017-03-20 -0.273931 1.464139 0.511461 0.707071
3. Selection
3.1 Getting
df.A,df[‘A’]:将A列的数据取出,两者的作用是相同的
12345678910111213141516In[10]: df.AOut[10]:2017-03-20 0.3877982017-03-27 0.2217802017-04-03 -1.5853172017-04-10 0.035160Freq: W-MON, Name: A, dtype: float64In[11]: df['A']Out[11]:2017-03-20 0.3877982017-03-27 0.2217802017-04-03 -1.5853172017-04-10 0.035160Freq: W-MON, Name: A, dtype: float64按行取数据
123456789101112131415In[12]: df[0:3]Out[12]:A B C D2017-03-20 0.387798 -0.058203 -0.169525 -0.6125712017-03-27 0.221780 -1.170097 0.876672 0.5527942017-04-03 -1.585317 2.253459 0.109537 0.028095In[14]: df['2017-03-20':'2017-04-03']Out[14]:A B C D2017-03-20 0.387798 -0.058203 -0.169525 -0.6125712017-03-27 0.221780 -1.170097 0.876672 0.5527942017-04-03 -1.585317 2.253459 0.109537 0.028095
3.2 Selection by Label
df.loc:按照索引和列取数据
df.at:和df.loc类似,但是只能取一个元素的值,只取一个元素的值,at是更为快速的方法
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455In[16]: df.loc['2017-03-20']Out[16]:A 0.387798B -0.058203C -0.169525D -0.612571Name: 2017-03-20 00:00:00, dtype: float64In[17]: df.loc[dates[0]]Out[17]:A 0.387798B -0.058203C -0.169525D -0.612571Name: 2017-03-20 00:00:00, dtype: float64In[34]: df.loc[:,'A']Out[34]:2017-03-20 0.3877982017-03-27 0.2217802017-04-03 -1.5853172017-04-10 0.035160Freq: W-MON, Name: A, dtype: float64In[36]: df.loc[dates[0:3],'A']Out[36]:2017-03-20 0.3877982017-03-27 0.2217802017-04-03 -1.585317Freq: W-MON, Name: A, dtype: float64In[37]: df.loc[dates[0:3],['A','B']]Out[37]:A B2017-03-20 0.387798 -0.0582032017-03-27 0.221780 -1.1700972017-04-03 -1.585317 2.253459In[39]: df.loc[:,['A','B']]Out[39]:A B2017-03-20 0.387798 -0.0582032017-03-27 0.221780 -1.1700972017-04-03 -1.585317 2.2534592017-04-10 0.035160 0.143292In[74]: df.at[dates[0],'A']Out[74]:0.38779782953047309
3.3 Selection by Position
df.iloc:根据位置取数据,类似于matlab里面的取矩阵的方法
df.iat:和df.iloc类似,但是只能取一个元素的值
123456789101112131415161718192021222324252627282930313233343536373839In[44]: df.iloc[3]Out[44]:A 0.035160B 0.143292C 0.139537D -0.823278Name: 2017-04-10 00:00:00, dtype: float64In[47]: df.iloc[2:4]Out[47]:A B C D2017-04-03 -1.585317 2.253459 0.109537 0.0280952017-04-10 0.035160 0.143292 0.139537 -0.823278In[43]: df.iloc[2:4,0:2]Out[43]:A B2017-04-03 -1.585317 2.2534592017-04-10 0.035160 0.143292In[49]: df.iloc[:,2:4]Out[49]:C D2017-03-20 -0.169525 -0.6125712017-03-27 0.876672 0.5527942017-04-03 0.109537 0.0280952017-04-10 0.139537 -0.823278In[50]: df.iloc[3,2]Out[50]:0.13953667147016041In[75]: df.iat[0,0]Out[75]:0.38779782953047309
3.4 Boolean Indexing
12345678910111213141516In[52]: df[df.A > 0]Out[52]:A B C D2017-03-20 0.387798 -0.058203 -0.169525 -0.6125712017-03-27 0.221780 -1.170097 0.876672 0.5527942017-04-10 0.035160 0.143292 0.139537 -0.823278In[53]: df[df > 0]Out[53]:A B C D2017-03-20 0.387798 NaN NaN NaN2017-03-27 0.221780 NaN 0.876672 0.5527942017-04-03 NaN 2.253459 0.109537 0.0280952017-04-10 0.035160 0.143292 0.139537 NaNisin()
1234567891011121314151617181920212223242526In[67]: df2 = df.copy()In[68]: df2['E'] = ['one', 'one','two','three']In[69]: df2Out[69]:A B C D E2017-03-20 0.387798 -0.058203 -0.169525 -0.612571 one2017-03-27 0.221780 -1.170097 0.876672 0.552794 one2017-04-03 -1.585317 2.253459 0.109537 0.028095 twoIn[70]: df2['E'].isin( ['one', 'one','two'])Out[70]:2017-03-20 True2017-03-27 True2017-04-03 True2017-04-10 FalseFreq: W-MON, Name: E, dtype: boolIn[71]: df2[df2['E'].isin(['one', 'one','two'])]Out[71]:A B C D E2017-03-20 0.387798 -0.058203 -0.169525 -0.612571 one2017-03-27 0.221780 -1.170097 0.876672 0.552794 one2017-04-03 -1.585317 2.253459 0.109537 0.028095 two
4. Missing Data
缺失值:NaN
df.reindex:重置索引,原先不存在的索引所对应的值为NaN
df.dropna():将缺失值删除
df.fillna():填充缺失值
|
|
5. Operations
5.1 Stats
df.mean():默认按照索引取均值
123456789101112131415161718192021222324In[83]: df.mean(0)Out[83]:A -0.235145B 0.292113C 0.239055D -0.213740dtype: float64In[84]: df.mean()Out[84]:A -0.235145B 0.292113C 0.239055D -0.213740dtype: float64In[85]: df.mean(1)Out[85]:2017-03-20 -0.1131252017-03-27 0.1202872017-04-03 0.2014432017-04-10 -0.126322Freq: W-MON, dtype: float64df.sub 减法,df.add 加法
1234567891011121314151617In[112]: df.sub(df['A'],axis=0)Out[112]:A B C D2017-03-20 0.0 -0.446001 -0.557323 -1.0003692017-03-27 0.0 -1.391877 0.654892 0.3310142017-04-03 0.0 3.838775 1.694853 1.6134122017-04-10 0.0 0.108132 0.104377 -0.858438In[113]: df.add(df['A'],axis=0)Out[113]:A B C D2017-03-20 0.775596 0.329595 0.218273 -0.2247732017-03-27 0.443560 -0.948317 1.098452 0.7745742017-04-03 -3.170634 0.668142 -1.475780 -1.5572222017-04-10 0.070320 0.178452 0.174697 -0.788119
5.2 Apply
df.apply():Applying functions to the data
|
|
5.3 Histogramming
value_counts() 计数
|
|
5.4 String Methods
Series.str下面还有很多方法
|
|
6. Merge
6.1 Concat
默认axis=0,按行拼接,axis=1,按列拼接
|
|
6.2 Join
6.3 Append
Append rows to a dataframe. See the Appending
|
|
6.4 Grouping
df.groupby(by=)
|
|
6.5 Reshaping
Stack
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',....: 'foo', 'foo', 'qux', 'qux'],....: ['one', 'two', 'one', 'two',....: 'one', 'two', 'one', 'two']]))tuplesOut[27]:[('bar', 'one'),('bar', 'two'),('baz', 'one'),('baz', 'two'),('foo', 'one'),('foo', 'two'),('qux', 'one'),('qux', 'two')]index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])indexOut[29]:MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],names=['first', 'second'])df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])dfOut[31]:A Bfirst secondbar one 0.398046 0.488732two -0.024753 -0.567505baz one -0.509408 0.055228two 0.936854 0.326829foo one 0.390471 0.540038two -1.001224 0.858414qux one -1.640981 0.103979two -0.167781 0.949106df[:4]Out[32]:A Bfirst secondbar one 0.398046 0.488732two -0.024753 -0.567505baz one -0.509408 0.055228two 0.936854 0.326829stacked = df2.stack()stackedOut[34]:first secondbar one A -1.131337B 0.734982two A 0.008166B -1.358364baz one A -1.578985B 0.399612two A 1.697060B 0.830027dtype: float64stacked.unstack()Out[35]:A Bfirst secondbar one -1.131337 0.734982two 0.008166 -1.358364baz one -1.578985 0.399612two 1.697060 0.830027stacked.unstack().unstack()Out[36]:A Bsecond one two one twofirstbar -1.131337 0.008166 0.734982 -1.358364baz -1.578985 1.697060 0.399612 0.830027stacked.unstack(1)Out[37]:second one twofirstbar A -1.131337 0.008166B 0.734982 -1.358364baz A -1.578985 1.697060B 0.399612 0.830027stacked.unstack(0)Out[38]:first bar bazsecondone A -1.131337 -1.578985B 0.734982 0.399612two A 0.008166 1.697060B -1.358364 0.830027stacked.unstack(2)Out[39]:A Bfirst secondbar one -1.131337 0.734982two 0.008166 -1.358364baz one -1.578985 0.399612two 1.697060 0.830027
6.6 Time Series
resample:对时间序列做类似于groupby的操作
12345678910111213141516171819202122232425In[163]: rng = pd.date_range('1/1/2012', periods=10, freq='S')In[164]: ts = pd.Series(np.random.randn(10), index=rng)In[165]: tsOut[165]:2012-01-01 00:00:00 0.9249842012-01-01 00:00:01 -1.2534912012-01-01 00:00:02 -0.4179122012-01-01 00:00:03 -0.1047312012-01-01 00:00:04 -1.1659802012-01-01 00:00:05 -0.5358972012-01-01 00:00:06 -0.9371042012-01-01 00:00:07 -0.1437322012-01-01 00:00:08 -0.2926962012-01-01 00:00:09 -0.700354Freq: S, dtype: float64In[166]: ts.resample('2S').sum()Out[166]:2012-01-01 00:00:00 -0.3285072012-01-01 00:00:02 -0.5226432012-01-01 00:00:04 -1.7018772012-01-01 00:00:06 -1.0808362012-01-01 00:00:08 -0.993050Freq: 2S, dtype: float64to_period,to_timestamp:Timestamped data can be converted to PeriodIndex-ed data using
to_period
and vice-versa usingto_timestamp
123456789101112131415161718192021222324252627282930313233343536373839404142In[167]: rng = pd.date_range('1/1/2012', periods=5, freq='M')In[168]: ts = pd.Series(np.random.randn(len(rng)), index=rng)In[169]: tsOut[169]:2012-01-31 -0.1744292012-02-29 2.0734882012-03-31 1.4594422012-04-30 0.7031982012-05-31 -1.796758Freq: M, dtype: float64In[175]: ps = ts.to_period()In[176]: psOut[176]:2012-01 -0.1744292012-02 2.0734882012-03 1.4594422012-04 0.7031982012-05 -1.796758Freq: M, dtype: float64In[177]: ps.to_timestamp()Out[177]:2012-01-01 -0.1744292012-02-01 2.0734882012-03-01 1.4594422012-04-01 0.7031982012-05-01 -1.796758Freq: MS, dtype: float64# Remember that ‘s’ and ‘e’ can be used to return the timestamps at the start or end of the period:In[179]: ps.to_timestamp('D', 'e')Out[179]:2012-01-31 -0.1744292012-02-29 2.0734882012-03-31 1.4594422012-04-30 0.7031982012-05-31 -1.796758Freq: M, dtype: float64
7. Getting Data In/Out
7.1 CSV
df.to_csv()
pd.read_csv()
7.2 Json
to_json
read_json
|
|
7.3 Excel
df.to_excel
pd.read_excel
|
|